Goto

Collaborating Authors

 failure mode


CB-SLICE: Concept-Based Interpretable Error Slice Discovery

arXiv.org Machine Learning

Despite strong average-case performance, deep learning models often exhibit systematic errors on specific population groups, known as error slices. Identifying these groups and the root causes of their failures is critical for model debugging and bias mitigation. However, existing error Slice Discovery Methods (SDMs) typically generate explanations disconnected from the model's inference process, thus only approximating the underlying error source and may be inaccurate. We address this limitation by leveraging Concept Bottleneck Models (CBMs), whose predictions are directly dependent on human-understandable semantic concepts. Since downstream task failures in CBMs commonly arise from concept mispredictions, concept representations provide a strong candidate for error slice identification, offering fine-grained explanations directly linked to the error source. Building on this insight, we introduce CB-SLICE, a concept-based SDM that groups samples with shared concept prediction failures and identifies the keyword concepts most responsible for each slice's failure mode. Across multiple benchmarks, we show that CB-SLICE outperforms state-of-the-art methods in uncovering well-known biases while providing richer and more faithful explanations of model errors.


Causality as the Statistical Conscience of Artificial Intelligence: From Pearl's Ladder to Trustworthy Machines

arXiv.org Machine Learning

Modern Artificial Intelligence achieves remarkable predictive power by optimizing statistical risk functionals over vast corpora. Yet a gap separates this from genuine intelligence: the inability to distinguish correlation from causation. This paper argues that causal inference (identifying mechanisms invariant under intervention) is AI's indispensable statistical conscience. Without causal grounding, AI systems are correlation machines: powerful in familiar domains, brittle under distribution shift, and biased in high-stakes settings. Three contributions develop this argument. First, a Statistical Necessity Theorem for Causal Generalization: any algorithm achieving out-of-distribution generalization must encode causal structure, formalizing the distinction between prediction P(Y|X) and intelligence P(Y|do(X)). Second, a unified framework connects Pearl's do-calculus, the Potential Outcomes framework, Double Machine Learning, and Invariant Risk Minimization as a family of Causal Statistical Estimators, each identifying interventional distributions under different assumptions. Third, three AI failure modes (hallucination in large language models, reward hacking in reinforcement learning from human feedback, and degradation under distribution shift) are manifestations of causal blindness, each admitting a principled statistical remedy. Trustworthy AI is, at its core, a problem of causal statistics. The statistical community is not merely equipped to solve it -- it is the only community with the foundational tools to do so rigorously.


Adaptive auditing of AI systems with anytime-valid guarantees

arXiv.org Machine Learning

A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance below a target threshold versus (ii) the auditor's null that asserts they have a sampling strategy that will uncover a failure mode. Leveraging Safe Anytime-Valid Inference (SAVI), we formalize the auditor as conducting 'testing by betting', which translates into simultaneous e-processes for testing the dueling null hypotheses. Furthermore, if the auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust. Empirically, we demonstrate that our proposed testing procedures maintain anytime-valid type-I error control, outperform pre-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations.


Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

arXiv.org Machine Learning

Large language models hallucinate in predictable ways: attention routing fails by over-concentrating on a narrow set of positions, or by spreading so diffusely that relevance is diluted, and the shape of the failure carries diagnostic signal. A widely used family of spectral methods analyzes the symmetric component of the degree-normalized attention operator, which governs transport capacity; we prove that every transpose-invariant spectral diagnostic of this operator is structurally orientation-blind (it cannot distinguish an operator from its transpose, and therefore cannot detect information-flow direction), with a quantitative converse establishing the asymmetry coefficient $G$ as the unique control parameter for direction. Pairing this with a closed-form bipartite-Cheeger landscape for canonical causal architectures, we show that uniform causal attention satisfies an $n$-independent floor $ฯ•\ge 1/5$ with worst cut at $t^\ast/n \approx 0.32$, while window attention pierces the floor as $O(w/n)$; failure modes are shape-different, not just value-different. The resulting two-axis diagnostic ($ฯ•$ for capacity, $G$ for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity. Under length-controlled evaluation, transport features retain interpretable signal (LC-AUROC from 0.62 to 0.84) on tested models up to 8B parameters, with polarity reversing as predicted between HaluEval and MedHallu.



Hierarchy-Guided Topology Latent Flow for Molecular Graph Generation

arXiv.org Machine Learning

Generating chemically valid 3D molecules is hindered by discrete bond topology: small local bond errors can cause global failures (valence violations, disconnections, implausible rings), especially for drug-like molecules with long-range constraints. Many unconditional 3D generators emphasize coordinates and then infer bonds or rely on post-processing, leaving topology feasibility weakly controlled. We propose Hierarchy-Guided Latent Topology Flow (HLTF), a planner-executor model that generates bond graphs with 3D coordinates, using a latent multi-scale plan for global context and a constraint-aware sampler to suppress topology-driven failures. On QM9, HLTF achieves 98.8% atom stability and 92.9% valid-and-unique, improving PoseBusters validity to 94.0% (+0.9 over the strongest reported baseline). On GEOM-DRUGS, HLTF attains 85.5%/85.0% validity/valid-unique-novel without post-processing and 92.2%/91.2% after standardized relaxation, within 0.9 points of the best post-processed baseline. Explicit topology generation also reduces "false-valid" samples that pass RDKit sanitization but fail stricter checks.